Ayman BEN HAJJAJ & Jules RUBIN
This study is the final project of the Machine-Learning II course at EFREI Paris (Master 1 Data Science & AI 2023). This project aims to detect DDOS attacks in a network using machine learning. The dataset used is the CIC-DDoS2019 dataset. This dataset contains 78 features and 430K rows. The dataset contains 18 types of attacks. The goal is to detect the attacks using machine learning. The taxonomy of attacks present in the dataset are described in the research paper DDoS Evaluation Dataset (CIC-DDoS2019).
Taxonomy of attacks present in the dataset:

As we have the choice on the type of attack to detect, we will focus on the detection of 3 types of attacks:
We will also provide a method that can classify the benign traffic from the malicious traffic.
# making the necessary imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.filterwarnings('ignore')
df = pd.DataFrame()
df = pd.concat([df, pd.read_parquet('data/Syn-training.parquet')])
df = pd.concat([df, pd.read_parquet('data/DNS-testing.parquet')])
df = pd.concat([df, pd.read_parquet('data/UDP-training.parquet')])
df['Label'].value_counts()
Syn 43302 Benign 32901 UDP 14792 DrDoS_DNS 3669 MSSQL 145 Name: Label, dtype: int64
# We can remove the MSSQL data as it is not required for our analysis
df = df[df['Label'] != 'MSSQL']
df['Label'].value_counts()
Syn 43302 Benign 32901 UDP 14792 DrDoS_DNS 3669 Name: Label, dtype: int64
# count the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)
Protocol 0
CWE Flag Count 0
Fwd Avg Packets/Bulk 0
Fwd Avg Bytes/Bulk 0
Avg Bwd Segment Size 0
..
Bwd IAT Total 0
Fwd IAT Min 0
Fwd IAT Max 0
Fwd IAT Std 0
Label 0
Length: 78, dtype: int64
# check for duplicate rows
df.duplicated().sum()
194
# remove duplicate rows
df.drop_duplicates(inplace=True)
We do not have any missing values in the dataset. However, we have some duplicates. We will remove them.
# count the number of unique values in each column
df.nunique().sort_values(ascending=True).head(25)
Bwd Avg Bulk Rate 1 Bwd Avg Packets/Bulk 1 Bwd Avg Bytes/Bulk 1 Fwd Avg Bulk Rate 1 Fwd Avg Packets/Bulk 1 Fwd Avg Bytes/Bulk 1 ECE Flag Count 1 Fwd URG Flags 1 Bwd PSH Flags 1 Bwd URG Flags 1 PSH Flag Count 1 FIN Flag Count 1 URG Flag Count 2 Fwd PSH Flags 2 RST Flag Count 2 ACK Flag Count 2 CWE Flag Count 2 SYN Flag Count 2 Protocol 3 Label 4 Down/Up Ratio 15 Bwd IAT Min 105 Bwd Packet Length Min 195 Fwd Act Data Packets 212 Total Fwd Packets 263 dtype: int64
As some columns has only one unique value, they do not bring any information. We will remove them. We can also see that some columns are categorical variables. We will convert them to numerical variables using the OneHotEncoder.
one_value_cols = [col for col in df.columns if df[col].nunique() <= 1]
df = df.drop(one_value_cols, axis=1)
three_value_cols = [col for col in df.columns if df[col].nunique() <= 3]
# One Hot Encoding
df = pd.get_dummies(df, columns=three_value_cols)
df['Label'].value_counts()
Syn 43302 Benign 32707 UDP 14792 DrDoS_DNS 3669 Name: Label, dtype: int64
# As the dataset is imbalanced, we will balance it by taking 3000 samples from each class
balanced = pd.DataFrame()
balanced = pd.concat([balanced, df[df['Label'] == 'Syn'].sample(n=3500)])
balanced = pd.concat([balanced, df[df['Label'] == 'DrDoS_DNS'].sample(n=3500)])
balanced = pd.concat([balanced, df[df['Label'] == 'UDP'].sample(n=3500)])
balanced = pd.concat([balanced, df[df['Label'] == 'Benign'].sample(n=3500)])
df = balanced.copy()
# free up memory
del balanced
df.describe()
| Flow Duration | Total Fwd Packets | Total Backward Packets | Fwd Packets Length Total | Bwd Packets Length Total | Fwd Packet Length Max | Fwd Packet Length Min | Fwd Packet Length Mean | Fwd Packet Length Std | Bwd Packet Length Max | ... | SYN Flag Count_0 | SYN Flag Count_1 | RST Flag Count_0 | RST Flag Count_1 | ACK Flag Count_0 | ACK Flag Count_1 | URG Flag Count_0 | URG Flag Count_1 | CWE Flag Count_0 | CWE Flag Count_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.400000e+04 | 14000.000000 | 14000.000000 | 14000.000000 | 1.400000e+04 | 14000.000000 | 14000.000000 | 14000.000000 | 14000.000000 | 14000.000000 | ... | 14000.000000 | 14000.000000 | 14000.000000 | 14000.000000 | 14000.000000 | 14000.000000 | 14000.000000 | 14000.000000 | 14000.000000 | 14000.000000 |
| mean | 1.388123e+07 | 7.376929 | 3.636143 | 1642.728516 | 1.945576e+03 | 397.850128 | 340.514069 | 354.947144 | 20.607582 | 96.551712 | ... | 0.999357 | 0.000643 | 0.969000 | 0.031000 | 0.692571 | 0.307429 | 0.899000 | 0.101000 | 0.946000 | 0.054000 |
| std | 2.722049e+07 | 21.201320 | 26.343032 | 6166.372559 | 3.647773e+04 | 513.794617 | 486.089508 | 482.955658 | 73.789528 | 443.775818 | ... | 0.025347 | 0.025347 | 0.173324 | 0.173324 | 0.461445 | 0.461445 | 0.301339 | 0.301339 | 0.226026 | 0.226026 |
| min | 1.000000e+00 | 1.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 4.900000e+01 | 2.000000 | 0.000000 | 60.000000 | 0.000000e+00 | 6.000000 | 6.000000 | 6.000000 | 0.000000 | 0.000000 | ... | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
| 50% | 1.054630e+05 | 4.000000 | 0.000000 | 750.000000 | 0.000000e+00 | 357.000000 | 47.000000 | 128.485291 | 0.000000 | 0.000000 | ... | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
| 75% | 7.107133e+06 | 6.000000 | 2.000000 | 2088.000000 | 2.400000e+01 | 426.000000 | 330.000000 | 359.500000 | 22.516661 | 6.000000 | ... | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
| max | 1.199662e+08 | 1308.000000 | 2025.000000 | 130324.000000 | 3.187726e+06 | 3619.000000 | 1729.000000 | 1729.000000 | 1199.664673 | 3607.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 73 columns
# check for the outliers
fig, ax = plt.subplots(4, 2, figsize=(15, 15))
# create a list of columns to check for outliers
outliers_col = ['Flow Duration', 'Total Fwd Packets', 'Total Backward Packets', 'Fwd Packet Length Max',
'Bwd Packet Length Max', 'Flow Bytes/s', 'Flow Packets/s', 'Flow IAT Mean', 'Flow IAT Std']
# create a for loop to iterate over the columns
for i in range(0, 4):
for j in range(0, 2):
col = outliers_col[i * 2 + j]
# create a boxplot for each column
sns.boxplot(x=df[col], ax=ax[i, j])
plt.show()
# remove the outliers using IsolationForest
from sklearn.ensemble import IsolationForest
# create an instance of the IsolationForest class
iso = IsolationForest(n_estimators=1000, max_samples='auto', contamination=float(0.05), max_features=1.0,
bootstrap=False, n_jobs=-1, random_state=42, verbose=0)
# fit the model
yhat = iso.fit_predict(df.drop('Label', axis=1))
# select all rows that are not outliers
mask = yhat != -1
df = df[mask]
df.shape
(13300, 74)
df['Label'].value_counts()
UDP 3500 Syn 3495 DrDoS_DNS 3489 Benign 2816 Name: Label, dtype: int64
We can see that there is a few outliers among the attacks and around 20% of outliers among the benign traffic. We have removed them.
# put the Label colum in first place
df = pd.concat([df['Label'], df.drop('Label', axis=1)], axis=1)
df
| Label | Flow Duration | Total Fwd Packets | Total Backward Packets | Fwd Packets Length Total | Bwd Packets Length Total | Fwd Packet Length Max | Fwd Packet Length Min | Fwd Packet Length Mean | Fwd Packet Length Std | ... | SYN Flag Count_0 | SYN Flag Count_1 | RST Flag Count_0 | RST Flag Count_1 | ACK Flag Count_0 | ACK Flag Count_1 | URG Flag Count_0 | URG Flag Count_1 | CWE Flag Count_0 | CWE Flag Count_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18047 | Syn | 35741973 | 6 | 0 | 36.0 | 0.0 | 6.0 | 6.0 | 6.0 | 0.0 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
| 33377 | Syn | 67742929 | 8 | 6 | 48.0 | 36.0 | 6.0 | 6.0 | 6.0 | 0.0 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
| 5914 | Syn | 60932490 | 12 | 6 | 72.0 | 36.0 | 6.0 | 6.0 | 6.0 | 0.0 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
| 35902 | Syn | 23141 | 4 | 0 | 24.0 | 0.0 | 6.0 | 6.0 | 6.0 | 0.0 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
| 19425 | Syn | 62915480 | 10 | 4 | 60.0 | 24.0 | 6.0 | 6.0 | 6.0 | 0.0 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 52023 | Benign | 128 | 1 | 2 | 6.0 | 12.0 | 6.0 | 6.0 | 6.0 | 0.0 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
| 52136 | Benign | 20878 | 2 | 2 | 62.0 | 164.0 | 31.0 | 31.0 | 31.0 | 0.0 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 64240 | Benign | 22653 | 2 | 2 | 80.0 | 112.0 | 40.0 | 40.0 | 40.0 | 0.0 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 61858 | Benign | 21517 | 2 | 2 | 82.0 | 114.0 | 41.0 | 41.0 | 41.0 | 0.0 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 51244 | Benign | 25 | 2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
13300 rows × 74 columns
# Plot the distribution of the Flow Duration for each class
fig, ax = plt.subplots(1, 4, figsize=(20, 5))
sns.distplot(df[df['Label'] == 'Syn']['Flow Duration'], ax=ax[0], color='r')
ax[0].set_title('Syn')
sns.distplot(df[df['Label'] == 'DrDoS_DNS']['Flow Duration'], ax=ax[1], color='b')
ax[1].set_title('DrDoS_DNS')
sns.distplot(df[df['Label'] == 'UDP']['Flow Duration'], ax=ax[2], color='g')
ax[2].set_title('UDP')
sns.distplot(df[df['Label'] == 'Benign']['Flow Duration'], ax=ax[3], color='y')
ax[3].set_title('Benign')
plt.show()
# Plot the distribution of the Total Fwd Packets for each class
fig, ax = plt.subplots(1, 4, figsize=(20, 5))
sns.distplot(df[df['Label'] == 'Syn']['Total Fwd Packets'], ax=ax[0], color='r')
ax[0].set_title('Syn')
sns.distplot(df[df['Label'] == 'DrDoS_DNS']['Total Fwd Packets'], ax=ax[1], color='b')
ax[1].set_title('DrDoS_DNS')
sns.distplot(df[df['Label'] == 'UDP']['Total Fwd Packets'], ax=ax[2], color='g')
ax[2].set_title('UDP')
sns.distplot(df[df['Label'] == 'Benign']['Total Fwd Packets'], ax=ax[3], color='y')
ax[3].set_title('Benign')
plt.show()
# Plot the distribution of the Total Backward Packets for each class
fig, ax = plt.subplots(1, 4, figsize=(20, 5))
sns.distplot(df[df['Label'] == 'Syn']['Total Backward Packets'], ax=ax[0], color='r')
ax[0].set_title('Syn')
sns.distplot(df[df['Label'] == 'DrDoS_DNS']['Total Backward Packets'], ax=ax[1], color='b')
ax[1].set_title('DrDoS_DNS')
sns.distplot(df[df['Label'] == 'UDP']['Total Backward Packets'], ax=ax[2], color='g')
ax[2].set_title('UDP')
sns.distplot(df[df['Label'] == 'Benign']['Total Backward Packets'], ax=ax[3], color='y')
ax[3].set_title('Benign')
plt.show()
# Plot the distribution of the Flow Bytes/s for each class
fig, ax = plt.subplots(1, 4, figsize=(20, 5))
sns.distplot(df[df['Label'] == 'Syn']['Flow Bytes/s'], ax=ax[0], color='r')
ax[0].set_title('Syn')
sns.distplot(df[df['Label'] == 'DrDoS_DNS']['Flow Bytes/s'], ax=ax[1], color='b')
ax[1].set_title('DrDoS_DNS')
sns.distplot(df[df['Label'] == 'UDP']['Flow Bytes/s'], ax=ax[2], color='g')
ax[2].set_title('UDP')
sns.distplot(df[df['Label'] == 'Benign']['Flow Bytes/s'], ax=ax[3], color='y')
ax[3].set_title('Benign')
plt.show()
# Plot the distribution of the Flow Packets/s for each class
fig, ax = plt.subplots(1, 4, figsize=(20, 5))
sns.distplot(df[df['Label'] == 'Syn']['Flow Packets/s'], ax=ax[0], color='r')
ax[0].set_title('Syn')
sns.distplot(df[df['Label'] == 'DrDoS_DNS']['Flow Packets/s'], ax=ax[1], color='b')
ax[1].set_title('DrDoS_DNS')
sns.distplot(df[df['Label'] == 'UDP']['Flow Packets/s'], ax=ax[2], color='g')
ax[2].set_title('UDP')
sns.distplot(df[df['Label'] == 'Benign']['Flow Packets/s'], ax=ax[3], color='y')
ax[3].set_title('Benign')
plt.show()
We can see that the distribution of the Syn attack is very different from the other classes. The values can be very high when regarding the number of packets sent or the flow duration. However, the DNS attack have some really different charasterisitcs to benign traffic. We can see that the Flow Bytes/s and Flow Packets/s are very high for the DNS attack.
These visualizations help us to have a better understanding of the characteristics of the attacks.
# Plot the correlation matrix
corr = df.drop('Label', axis=1).corr()
plt.figure(figsize=(10, 10))
sns.heatmap(corr, annot=False, cmap='coolwarm')
plt.show()
We can see that we have high correlation between some features such as Packets Min/Max/Mean and Fwd Packets Min/Max/Mean, Idle Mean/Std/Min/Max and Flow IAT Mean/Std/Min/Max, etc. The OneHotEncoder has created some columns that are perfectly anti-correlated as we can see with the last columns.
The correlations will be handled by the PCA after the Scaling.
# we standardize the features
from sklearn.preprocessing import StandardScaler
df.reset_index(inplace=True, drop=True)
# separate the features from the labels
X = df.drop('Label', axis=1)
y = df['Label']
# standardize the features
X = StandardScaler().fit_transform(X)
# find the optimal number of components
from sklearn.decomposition import PCA
df_pca = PCA().fit(X)
plt.plot(np.cumsum(df_pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.title('Explained variance vs number of components')
plt.show()
We see that to get 80% of the explained variance, we need to keep 12 components.
pca = PCA(n_components=12)
principalComponents = pca.fit_transform(X)
# create a dataframe with the principal components
df_pca = pd.DataFrame(data=principalComponents, columns=['PC' + str(i) for i in range(1, 13)])
# concatenate the labels to the dataframe
df_pca = pd.concat([df_pca, df[['Label']]], axis=1)
# print the first 5 rows of the dataframe
df_pca.head()
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | Label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6.529118 | -2.239074 | -0.567663 | -0.364401 | -0.536970 | -2.043625 | -3.338835 | 1.708432 | -0.003668 | -0.179944 | -0.229499 | 0.252435 | Syn |
| 1 | 13.271947 | -4.654444 | 2.070758 | -1.376058 | 1.596436 | 0.542698 | -0.000572 | -3.511027 | -0.000522 | 0.039492 | -0.336194 | 1.412558 | Syn |
| 2 | 9.349077 | -2.372076 | 1.610862 | -0.378344 | 1.370053 | 2.068493 | 3.131144 | -5.385500 | -0.020796 | 0.647247 | 0.683067 | -0.194844 | Syn |
| 3 | 0.626163 | 1.110381 | -1.516829 | 0.262022 | -1.220903 | -0.593366 | -0.045021 | 0.416983 | -0.047997 | 0.426737 | 0.889352 | -1.653700 | Syn |
| 4 | 10.200878 | -3.234997 | 1.478627 | -0.666336 | 1.391035 | 1.613825 | 2.646139 | -5.604995 | -0.019685 | 0.680888 | 0.643817 | 0.224534 | Syn |
in order to plot the PCA, we will keep 2 components.
# plot
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel('PC1', fontsize=15)
ax.set_ylabel('PC2', fontsize=15)
ax.set_title('2 component PCA', fontsize=20)
targets = ['Syn', 'DrDoS_DNS', 'UDP', 'Benign']
colors = ['r', 'b', 'g', 'y']
for target, color in zip(targets, colors):
indicesToKeep = df_pca['Label'] == target
ax.scatter(df_pca.loc[indicesToKeep, 'PC1'], df_pca.loc[indicesToKeep, 'PC2'], c=color, s=50 , alpha=0.5)
ax.legend(targets)
ax.grid()
plt.show()
# plot
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel('PC1', fontsize=15)
ax.set_ylabel('PC2', fontsize=15)
ax.set_title('2 component PCA', fontsize=20)
targets = ['Syn', 'DrDoS_DNS', 'UDP', 'Benign']
colors = ['r', 'b', 'g', 'y']
for target, color in zip(targets, colors):
indicesToKeep = df_pca['Label'] == target
ax.scatter(df_pca.loc[indicesToKeep, 'PC1'], df_pca.loc[indicesToKeep, 'PC2'], c=color, s=50 , alpha=0.5)
for i, txt in enumerate(df.drop('Label', axis=1).columns):
plt.arrow(0, 0, 200*pca.components_[0][i], 200*pca.components_[1][i], color='black', alpha=0.2, head_width=0.3, width=.1)
plt.annotate(txt, (200*pca.components_[0][i], 200*pca.components_[1][i]), size=7, alpha=0.7)
ax.legend(targets)
ax.grid()
plt.show()
# we do the same thing for a kernel PCA
from sklearn.decomposition import KernelPCA
# we do the same thing for a kernel PCA
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=73, kernel='rbf')
principalComponents = kpca.fit_transform(X)
explained_variance = np.var(principalComponents, axis=0)
explained_variance_ratio = explained_variance / np.sum(explained_variance)
# create a dataframe with the principal components
df_kpca = pd.DataFrame(data=principalComponents, columns=['PC' + str(i) for i in range(1, 74)])
# concatenate the labels to the dataframe
df_kpca = pd.concat([df_kpca, df[['Label']]], axis=1)
# print the first 5 rows of the dataframe
df_kpca.head()
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | ... | PC65 | PC66 | PC67 | PC68 | PC69 | PC70 | PC71 | PC72 | PC73 | Label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.643188 | -0.162278 | -0.353535 | -0.133856 | -0.289133 | -0.189921 | 0.140646 | -0.155415 | 0.017090 | 0.008395 | ... | 0.001296 | -0.010271 | 0.007927 | 0.006383 | 0.002042 | -0.006655 | 0.000597 | 0.005836 | -0.008529 | Syn |
| 1 | 0.569217 | 0.109413 | 0.117501 | 0.135406 | 0.295919 | 0.129899 | 0.020609 | -0.147531 | 0.083969 | 0.015308 | ... | 0.009547 | 0.016978 | -0.005995 | -0.003099 | 0.010069 | 0.010761 | -0.000090 | 0.009510 | 0.056592 | Syn |
| 2 | 0.622935 | 0.056317 | 0.066517 | 0.256754 | 0.533097 | 0.313556 | -0.174812 | 0.003289 | -0.010004 | 0.016072 | ... | 0.031583 | 0.014555 | -0.019246 | -0.047806 | 0.033876 | 0.045592 | -0.002529 | -0.010599 | 0.023279 | Syn |
| 3 | -0.054082 | -0.320709 | -0.010921 | -0.156224 | -0.113300 | 0.022851 | -0.054726 | 0.465741 | 0.074795 | -0.011173 | ... | 0.009587 | 0.010038 | -0.005111 | -0.004021 | -0.002758 | 0.029301 | -0.001466 | -0.002936 | 0.004350 | Syn |
| 4 | 0.593460 | 0.087865 | 0.110891 | 0.223374 | 0.484786 | 0.270257 | -0.111076 | -0.037728 | 0.029528 | 0.020607 | ... | -0.024004 | -0.005242 | 0.003324 | 0.020095 | -0.021699 | -0.039757 | 0.001898 | 0.003171 | -0.022911 | Syn |
5 rows × 74 columns
# plot cumulative explained variance
plt.plot(np.cumsum(explained_variance_ratio))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.title('Explained variance vs number of components')
plt.show()
# check if the file exists
if not os.path.exists('data/results/df_kpca.parquet'):
# we do the same thing for a kernel PCA
kpca = KernelPCA(n_components=8, kernel='rbf')
# create a dataframe with the principal components
df_kpca = pd.DataFrame(data=kpca.fit_transform(X), columns=['PC' + str(i) for i in range(1, 9)])
# concatenate the labels to the dataframe
df_kpca = pd.concat([df_kpca, df[['Label']]], axis=1)
# save the dataframe
df_kpca.to_parquet('data/results/df_kpca.parquet')
else:
df_kpca = pd.read_parquet('data/results/df_kpca.parquet')
# print the first 5 rows of the dataframe
df_kpca.head()
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | Label | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.694293 | -0.163515 | -0.390344 | -0.117720 | -0.268498 | -0.149944 | 0.094854 | -0.175512 | Syn |
| 1 | 0.648484 | -0.166958 | -0.382211 | -0.151421 | -0.295766 | -0.163081 | 0.129898 | -0.162038 | Syn |
| 2 | 0.598730 | 0.057647 | 0.043366 | 0.223421 | 0.495209 | 0.239183 | -0.173741 | -0.007366 | Syn |
| 3 | 0.681838 | -0.041536 | -0.119951 | 0.216507 | 0.471789 | 0.255171 | -0.282316 | 0.032216 | Syn |
| 4 | 0.511454 | -0.253601 | -0.345547 | -0.142021 | -0.278727 | -0.107336 | 0.028023 | 0.083457 | Syn |
# plot
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel('PC1', fontsize=15)
ax.set_ylabel('PC2', fontsize=15)
ax.set_title('2 component K-PCA', fontsize=20)
targets = ['Syn', 'DrDoS_DNS', 'UDP', 'Benign']
colors = ['r', 'b', 'g', 'y']
for target, color in zip(targets, colors):
indicesToKeep = df_kpca['Label'] == target
ax.scatter(df_kpca.loc[indicesToKeep, 'PC1'], df_kpca.loc[indicesToKeep, 'PC2'], c=color, s=50 , alpha=0.5)
ax.legend(targets)
ax.grid()
plt.show()
Both the PCA and the kernel-PCA did a good job at reducing the dimensionality of the dataset to a point where we can visualy see the clusters.
but we can see that its hard to distinguish the UDP attacks from the DrDoS_DNS attacks
from sklearn.manifold import TSNE
# scatter plot the data
def plot_tsne(df_tsne):
sns.scatterplot(x='PC1', y='PC2', hue='Label', data=df_tsne)
plt.show()
tsne = TSNE(n_components=2, verbose=1, random_state=123, perplexity=5)
df_tsne = pd.DataFrame(tsne.fit_transform(df.drop('Label', axis=1)), columns=['PC1', 'PC2'])
df_tsne['Label'] = df['Label'].values
plot_tsne(df_tsne)
[t-SNE] Computing 16 nearest neighbors... [t-SNE] Indexed 13300 samples in 0.009s... [t-SNE] Computed neighbors for 13300 samples in 0.674s... [t-SNE] Computed conditional probabilities for sample 1000 / 13300 [t-SNE] Computed conditional probabilities for sample 2000 / 13300 [t-SNE] Computed conditional probabilities for sample 3000 / 13300 [t-SNE] Computed conditional probabilities for sample 4000 / 13300 [t-SNE] Computed conditional probabilities for sample 5000 / 13300 [t-SNE] Computed conditional probabilities for sample 6000 / 13300 [t-SNE] Computed conditional probabilities for sample 7000 / 13300 [t-SNE] Computed conditional probabilities for sample 8000 / 13300 [t-SNE] Computed conditional probabilities for sample 9000 / 13300 [t-SNE] Computed conditional probabilities for sample 10000 / 13300 [t-SNE] Computed conditional probabilities for sample 11000 / 13300 [t-SNE] Computed conditional probabilities for sample 12000 / 13300 [t-SNE] Computed conditional probabilities for sample 13000 / 13300 [t-SNE] Computed conditional probabilities for sample 13300 / 13300 [t-SNE] Mean sigma: 0.000000 [t-SNE] KL divergence after 250 iterations with early exaggeration: 84.957932 [t-SNE] KL divergence after 1000 iterations: 1.058402
We can see that the t-SNE algorithm is able to separate the different classes. However, the classes are not enough separated. We need to adjust the perplexity parameter in order to have a better separation.
According to this article, The optimal perplexity parameter depends on the number of samples in the dataset. As we have around 13k samples, we can try to set the perplexity to 100.
In order to save_time and avoid to run the t-SNE algorithm for a long time, we will load the data that are already computed and have
# check if the file 'data/results/df_tsne.parquet' exists
if os.path.isfile('data/results/df_tsne.parquet'):
df_tsne = pd.read_parquet('data/results/df_tsne.parquet')
else:
tsne = TSNE(n_components=2, verbose=1, random_state=123, perplexity=100)
df_tsne = pd.DataFrame(tsne.fit_transform(df.drop('Label', axis=1)), columns=['PC1', 'PC2'])
df_tsne['Label'] = df['Label'].values
plot_tsne(df_tsne)
The result looks very good with a perplexity of 100. We can see that the classes are well separated. We only have more difficulty to separate the UDP attacks and Benign traffic. And the UDP attacks are not well separated from the DrDos DNS attacks. The Syn attacks are well separated from the other classes.
Without the Benign traffic, we can see that the clusters are well separated.
df_tsne_no_benign = df_tsne[df_tsne['Label'] != 'Benign']
plot_tsne(df_tsne_no_benign)
if os.path.isfile('data/results/df_tsne.parquet'):
df_tsne.to_parquet('data/results/df_tsne.parquet')
if os.path.isfile('data/results/df_tsne_no_benign.parquet'):
df_tsne_no_benign.to_parquet('data/results/df_tsne_no_benign.parquet')
# copy the dataframe to a new one
df_lda = df.copy()
# perform LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_lda.drop('Label', axis=1), df_lda['Label'], test_size=0.2, random_state=42)
# perform LDA
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
# make predictions
y_pred = lda.predict(X_test)
# calculate accuracy
accuracy_score(y_test, y_pred)
0.9793233082706767
the overall accuracy is verry high, we can calculate the precision and recall and f1 score for each class
# calculate the accuracy for each class
from sklearn.metrics import precision_score, recall_score, f1_score
print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision: [0.99113475 0.94257703 0.99438202 0.99253731] Recall: [0.98589065 0.99409158 0.99159664 0.94729345] F1: [0.98850575 0.96764917 0.99298738 0.96938776]
# plot the confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
<Axes: >
from sklearn.metrics import roc_curve, auc
def plot_ROC(X_train, y_train, X_test, y_test, model):
for cluster in y_test.unique():
# compute a temp with true/false values
y_test_temp = y_test == cluster
y_train_temp = y_train == cluster
# fit the model
model.fit(X_train, y_train_temp)
# predict probabilities
y_pred_temp = model.predict_proba(X_test)
# calculate the fpr and tpr for all thresholds of the classification
fpr, tpr, threshold = roc_curve(y_test_temp, y_pred_temp[:, 1])
roc_auc = auc(fpr, tpr)
print('AUC for class {}: {}'.format(cluster, roc_auc))
# plot the ROC curve
plt.plot(fpr, tpr, label='{} {} (AUC = {})'.format(cluster, 'traffic' if cluster == 'Benign' else 'attack', round(roc_auc, 3)))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.show()
plot_ROC(X_train, y_train, X_test, y_test, lda)
AUC for class Syn: 0.9933055956195428 AUC for class UDP: 0.9955267890661149 AUC for class DrDoS_DNS: 0.994199216233107 AUC for class Benign: 0.9937462660029948
# perform QDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
df_qda = df.copy()
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_qda.drop('Label', axis=1), df_qda['Label'], test_size=0.2, random_state=42)
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)
# make predictions
y_pred = qda.predict(X_test)
# calculate accuracy
accuracy_score(y_test, y_pred)
0.956015037593985
# calculate the accuracy for each class
from sklearn.metrics import precision_score, recall_score, f1_score
print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision: [0.98951049 0.99131944 1. 0.86832298] Recall: [0.99823633 0.84342688 0.99019608 0.9957265 ] F1: [0.99385426 0.91141261 0.99507389 0.92767087]
both overall and per class accuracy are lower than in an LDA
# plot the confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
<Axes: >
plot_ROC(X_train, y_train, X_test, y_test, qda)
AUC for class Syn: 0.9957983193277311 AUC for class UDP: 0.939628931201965 AUC for class DrDoS_DNS: 0.9879328799969609 AUC for class Benign: 0.9981878791402601
# perform LDA on the pca data
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_pca.drop('Label', axis=1), df_pca['Label'], test_size=0.2, random_state=42)
# perform LDA
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
# make predictions
y_pred = lda.predict(X_test)
accuracy_PCA_LDA = accuracy_score(y_test, y_pred)
# calculate the accuracy for each class
print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision: [0.99065421 0.97247706 0.95673877 0.70377937] Recall: [0.93474427 0.78286558 0.80532213 0.98148148] F1: [0.96188748 0.86743044 0.87452471 0.81975015]
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
<Axes: >
plot_ROC(X_train, y_train, X_test, y_test, lda)
AUC for class Syn: 0.9920622925429164 AUC for class UDP: 0.9870219771905165 AUC for class DrDoS_DNS: 0.9826624536030409 AUC for class Benign: 0.9761167442326862
# perform QDA on the pca data
# perform QDA
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)
# make predictions
y_pred = qda.predict(X_test)
accuracy_PCA_QDA = accuracy_score(y_test, y_pred)
# calculate the accuracy for each class
print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision: [0.98090278 0.98425197 0.99858557 0.93261456] Recall: [0.99647266 0.92319055 0.98879552 0.98575499] F1: [0.98862642 0.9527439 0.99366643 0.95844875]
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
<Axes: >
plot_ROC(X_train, y_train, X_test, y_test, qda)
AUC for class Syn: 0.994422229323384 AUC for class UDP: 0.9864403179009921 AUC for class DrDoS_DNS: 0.9926811427413665 AUC for class Benign: 0.9969837309381823
# perform LDA on the kpca data
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_kpca.drop('Label', axis=1), df_kpca['Label'], test_size=0.2, random_state=42)
# perform LDA
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
# make predictions
y_pred = lda.predict(X_test)
accuracy_KPCA_LDA = accuracy_score(y_test, y_pred)
# calculate the accuracy for each class
print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision: [0.84090909 0.96028881 0.98666667 0.85603113] Recall: [0.97883598 0.78698225 0.93277311 0.93883357] F1: [0.90464548 0.86504065 0.95896328 0.89552239]
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
<Axes: >
# perform QDA on the kpca data
# perform QDA
lda = QuadraticDiscriminantAnalysis()
lda.fit(X_train, y_train)
# make predictions
y_pred = lda.predict(X_test)
accuracy_KPCA_QDA = accuracy_score(y_test, y_pred)
# calculate the accuracy for each class
print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision: [0.96428571 0.94971264 0.96433471 0.98962963] Recall: [0.95238095 0.97781065 0.98459384 0.95021337] F1: [0.95829636 0.96355685 0.97435897 0.96952104]
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
<Axes: >
# perform LDA on the t-SNE data
# copy the dataframe to a new one
df_tsne_lda = df_tsne.copy()
# mpa each label to a number
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(df_tsne_lda['Label'])
df_tsne_lda['Label'] = le.transform(df_tsne_lda['Label'])
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_tsne_lda.drop('Label', axis=1), df_tsne_lda['Label'], test_size=0.2, random_state=42)
# perform LDA
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
# make predictions
y_pred = lda.predict(X_test)
accuracy_tsne_LDA = accuracy_score(y_test, y_pred)
# calculate the accuracy for each class
print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision: [0.68235294 0.83310902 0.81342282 0.74038462] Recall: [0.4084507 0.91432792 0.84992987 0.87749288] F1: [0.51101322 0.87183099 0.83127572 0.80312907]
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
<Axes: >
# perform QDA on the t-SNE data
# perform QDA
lda = QuadraticDiscriminantAnalysis()
lda.fit(X_train, y_train)
# make predictions
y_pred = lda.predict(X_test)
accuracy_tsne_QDA = accuracy_score(y_test, y_pred)
# calculate the accuracy for each class
print('Precision: ', precision_score(y_test, y_pred, average=None))
print('Recall: ', recall_score(y_test, y_pred, average=None))
print('F1: ', f1_score(y_test, y_pred, average=None))
Precision: [0.77468354 0.88841202 0.84246575 0.76315789] Recall: [0.53873239 0.91728213 0.86255259 0.90883191] F1: [0.63551402 0.90261628 0.85239085 0.82964889]
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
<Axes: >
the accuracy is very bad as we are not supposed to use t-SNE for classification
from sklearn.cluster import KMeans
First we try can try to apply the k-means algorithm on the dataset without any dimensionality reduction.
kmeans = KMeans(n_clusters=4, random_state=123)
kmeans.fit(df.drop('Label', axis=1))
df['kmeans'] = kmeans.labels_
# plot the cluster centers and the points on 2 axes : Total Fwd Packets and Flow Duration
plt.scatter(df['Total Fwd Packets'], df['Total Backward Packets'], c=df['kmeans'])
plt.xlabel('Total Fwd Packets')
plt.ylabel('Total Backward Packets')
plt.show()
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
for i in range(4):
df[df['kmeans'] == i]['Label'].value_counts().plot.pie(ax=ax[i//2][i%2], autopct='%.2f', fontsize=12)
ax[i//2][i%2].set_title('Cluster {}'.format(i))
plt.show()
df['kmeans'].value_counts(), df['Label'].value_counts()
(2 11431 0 875 1 753 3 241 Name: kmeans, dtype: int64, UDP 3500 Syn 3495 DrDoS_DNS 3489 Benign 2816 Name: Label, dtype: int64)
The kmeans algorithm is very bad on the raw datas. We will use the t-SNE results to see if we can have better results.
df.drop('kmeans', axis=1, inplace=True)
# elbow method to choose the number of clusters
distortions = []
K = range(1, 10)
for k in K:
kmeanModel = KMeans(n_clusters=k, random_state=123)
kmeanModel.fit(df_tsne.drop('Label', axis=1))
distortions.append(kmeanModel.inertia_)
plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method to find the optimal k')
plt.show()
According to the elbow method, the optimal number of clusters is 4, which is coherent with the number of classes in the dataset.
# create kmeans object
N = 4
kmeans = KMeans(n_clusters=N, random_state=123)
# fit kmeans object to data
kmeans.fit(df_tsne[['PC1', 'PC2']])
# print location of clusters learned by kmeans object
print(kmeans.cluster_centers_)
[[ 46.65798 -9.370605 ] [ 5.5882826 43.945114 ] [-46.13855 3.4791222] [ -4.4430566 -40.495094 ]]
# plot the data and the clusters learned
df_tsne['kmeans'] = kmeans.labels_
fig = plt.figure(figsize=(10, 10))
plt.subplot(221)
sns.scatterplot(x='PC1', y='PC2', hue='Label', data=df_tsne)
plt.title('Actual Labels')
plt.subplot(222)
sns.scatterplot(x='PC1', y='PC2', hue='kmeans', data=df_tsne)
plt.title('KMeans with {} Clusters'.format(N))
plt.show()
We can see that k-means is not able to separate the data into the correct clusters. This is because the shape of the k-means' clusters is always spherical, and it looks for clusters of equal variance, which in this case, is not the case.
We will try to use the k-means algorithm with the t-SNE results without the Benign traffic.
# create a plot grid of 2x2
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
# plot the pie charts of distributions of labels for each cluster
for i in range(4):
df_tsne[df_tsne['kmeans'] == i]['Label'].value_counts().plot.pie(ax=ax[i//2][i%2], autopct='%.2f', fontsize=12)
ax[i//2][i%2].set_title('Cluster {}'.format(i))
plt.show()
We can see that 3 of the 4 clusters are almost fully constituted of one attack class but we still have trouble to separate the Benign traffic from the attacks as the Benign traffic is mixed among the 4 clusters.
# evaluate the performance of the clustering using the silouette score
from sklearn.metrics import silhouette_score
print(silhouette_score(df_tsne[['PC1', 'PC2']], kmeans.labels_))
# plot the silhouette for the various clusters
from yellowbrick.cluster import SilhouetteVisualizer
visualizer = SilhouetteVisualizer(kmeans, colors='yellowbrick')
visualizer.fit(df_tsne[['PC1', 'PC2']])
visualizer.show()
0.42762336
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 13300 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
The silhouette plot is a graphical tool presenting how well our data points fit into the clusters they’ve been assigned to and how well they would fit into other clusters. The silhouette coefficient is a measure of cluster cohesion and separation.
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4, covariance_type='full', random_state=123)
gmm.fit(df_tsne[['PC1', 'PC2']])
df_tsne['gmm'] = gmm.predict(df_tsne[['PC1', 'PC2']])
# plot the results
fig = plt.figure(figsize=(10, 10))
plt.subplot(221)
sns.scatterplot(x='PC1', y='PC2', hue='Label', data=df_tsne, palette='Set1')
plt.title('Original labels')
plt.subplot(222)
sns.scatterplot(x='PC1', y='PC2', hue='gmm', data=df_tsne)
plt.title('GMM with {} components'.format(df_tsne['gmm'].nunique()))
plt.show()
# create a plot grid of 2x2
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
# plot the pie charts of distributions of labels for each cluster
for i in range(4):
df_tsne[df_tsne['gmm'] == i]['Label'].value_counts().plot.pie(ax=ax[i//2][i%2], autopct='%.2f', fontsize=12)
ax[i//2][i%2].set_title('Cluster {}'.format(i))
plt.show()
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.7, min_samples= 15).fit(df_tsne[['PC1', 'PC2']])
df_tsne['dbscan'] = dbscan.labels_
# plot the data and the clusters learned
fig = plt.figure(figsize=(10, 10))
plt.subplot(221)
sns.scatterplot(x='PC1', y='PC2', hue='Label', data=df_tsne)
plt.title('Actual Labels')
plt.subplot(222)
sns.scatterplot(x='PC1', y='PC2', hue='dbscan', data=df_tsne)
plt.title('DBSCAN')
plt.show()
We can see that the DBSCAN algorithm creates more than 100 clusters, wich is not what we want. We need to check for the best eps parameter in order to have a realistic number of clusters.
EPS = [0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9]
fig, ax = plt.subplots(4, 2, figsize=(20, 20))
for eps in EPS:
dbscan = DBSCAN(eps=eps, min_samples= 20).fit(df_tsne[['PC1', 'PC2']])
df_tsne['dbscan'] = dbscan.labels_
# plot the data and the clusters learned
sns.scatterplot(x='PC1', y='PC2', hue='dbscan', data=df_tsne, ax=ax[EPS.index(eps)//2][EPS.index(eps)%2])
ax[EPS.index(eps)//2][EPS.index(eps)%2].set_title('DBSCAN with eps={}'.format(eps))
plt.show()
import numpy as np
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree
def euclidean_distance(*args):
return np.sqrt(np.sum((args[0] - args[1]) ** 2))
For the hierarchical clustering, we will first use a subsample of the dataset in order to have a better visualization of the dendrogram.
samples = df_tsne.sample(100)
# compute the distance matrix between all the points
distance_matrix = np.zeros((samples[['PC1', 'PC2']].shape[0], samples[['PC1', 'PC2']].shape[0]))
for i in range(samples[['PC1', 'PC2']].shape[0]):
for j in range(samples[['PC1', 'PC2']].shape[0]):
distance_matrix[i, j] = euclidean_distance(samples[['PC1', 'PC2']].iloc[i], samples[['PC1', 'PC2']].iloc[j])
sns.heatmap(distance_matrix)
<Axes: >
# compute the linkage matrix
Z = linkage(samples[['PC1', 'PC2']], method='average', metric='euclidean')
# plot the dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(Z, leaf_rotation=90., leaf_font_size=8., labels=samples['Label'].values)
plt.show()
We can see that the hierarchical clustering is able to separate the different types of attacks, however, the benign traffic is mixed among the different clusters.
We can try to perform the hierarchical clustering algorthm after having classified the Benign traffic from the attacks thanks to the LDA model.
samples_no_benign = df_tsne_no_benign.sample(100)
Z_no_benign = linkage(samples_no_benign[['PC1', 'PC2']], method='average', metric='euclidean')
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(Z_no_benign, leaf_rotation=90., leaf_font_size=8., labels=samples_no_benign['Label'].values)
plt.show()
We can see that without the Benign class, the hierarchical clustering with euclidean distance has very good results on the 3 classes. We clearly see 3 clusters (orange, green and red) corresponding to each class. However, we still have One big cluster with a mix of UDP, DrDoS_DNS and a little bit of Syn.
Z_full = linkage(df_tsne_no_benign[['PC1', 'PC2']], method='average', metric='euclidean')
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(Z_full, leaf_rotation=90., leaf_font_size=8., no_labels=True)
plt.show()
We can see that we have 4 main clusters (orange, green, red and blue) whereas we should get only 3 classes. Let's see the proportions of each class in each cluster.
df_tsne_no_benign['hierarchical'] = cut_tree(Z_full, n_clusters=4)
# plot the pie charts of distributions of labels for each cluster
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
for i in range(4):
df_tsne_no_benign[df_tsne_no_benign['hierarchical'] == i]['Label'].value_counts().plot.pie(ax=ax[i // 2, i % 2], title='Cluster {}'.format(i), autopct='%.2f')
plt.show()
We can see that we have 3 clusters that are almost only composed of one class. The DrDoS DNS and UDP attacks can't be sparated in the same cluster. However, the Syn attacks are well separated from the other classes.
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
for cluster in range(0, df_tsne_no_benign['hierarchical'].nunique()):
sns.scatterplot(x='PC1', y='PC2', hue='Label', data=df_tsne_no_benign[df_tsne_no_benign['hierarchical'] == cluster], ax=ax[cluster // 2, cluster % 2])
ax[cluster // 2, cluster % 2].set_title('Cluster {}'.format(cluster))
# set the title of the plot
plt.suptitle('Hierarchical Clustering')
plt.show()
X_train, X_test, y_train, y_test = train_test_split(df.drop('Label', axis=1), df['Label'], test_size=0.2, random_state=42)
from sklearn.tree import DecisionTreeClassifier
# train a decision tree classifier on the training set
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
# predict the labels of the test set
y_pred = clf.predict(X_test)
# compute the accuracy of the predictions
accuracy_score(y_test, y_pred)
0.9890977443609023
# plot confusion matrix
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')
<Axes: >
# train a knn classifier on the training set
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
# predict the labels of the test set
y_pred = clf.predict(X_test)
# compute the accuracy of the predictions
accuracy_score(y_test, y_pred)
0.9657894736842105
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')
<Axes: >
# train a random forest classifier on the training set
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=123)
clf.fit(X_train, y_train)
# predict the labels of the test set
y_pred = clf.predict(X_test)
# compute the accuracy of the predictions
accuracy_score(y_test, y_pred)
0.9890977443609023
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')
<Axes: >
accuracy_LDA = 0.9763246899661782
accuracy_QDA = 0.9676813228109733
accuracy_DT = 0.9898496240601504
accuracy_KNN = 0.9616541353383459
accuracy_RF = 0.9902255639097745
x=['LDA', 'QDA', 'PCA/LDA', 'PCA/QDA', 'KPCA/LDA', 'KPCA/QDA', 't-SNE/LDA', 't-SNE/QDA', 'DT', 'KNN', 'RF']
y=[accuracy_LDA, accuracy_QDA, accuracy_PCA_LDA, accuracy_PCA_QDA, accuracy_KPCA_LDA, accuracy_KPCA_QDA, accuracy_tsne_LDA, accuracy_tsne_QDA, accuracy_DT, accuracy_KNN, accuracy_RF]
accuracies = {
label: accuracy for label, accuracy in zip(x, y)
}
accuracies = {k: v for k, v in sorted(accuracies.items(), key=lambda item: item[1])}
plt.figure(figsize=(15, 7))
sns.barplot(x=list(accuracies.keys()), y=list(accuracies.values()))
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.ylim(0.75, 1)
plt.show()
According to our study, we can get some conclusions: